4 research outputs found

    AsterixDB: A Scalable, Open Source BDMS

    Full text link
    AsterixDB is a new, full-function BDMS (Big Data Management System) with a feature set that distinguishes it from other platforms in today's open source Big Data ecosystem. Its features make it well-suited to applications like web data warehousing, social data storage and analysis, and other use cases related to Big Data. AsterixDB has a flexible NoSQL style data model; a query language that supports a wide range of queries; a scalable runtime; partitioned, LSM-based data storage and indexing (including B+-tree, R-tree, and text indexes); support for external as well as natively stored data; a rich set of built-in types; support for fuzzy, spatial, and temporal types and queries; a built-in notion of data feeds for ingestion of data; and transaction support akin to that of a NoSQL store. Development of AsterixDB began in 2009 and led to a mid-2013 initial open source release. This paper is the first complete description of the resulting open source AsterixDB system. Covered herein are the system's data model, its query language, and its software architecture. Also included are a summary of the current status of the project and a first glimpse into how AsterixDB performs when compared to alternative technologies, including a parallel relational DBMS, a popular NoSQL store, and a popular Hadoop-based SQL data analytics platform, for things that both technologies can do. Also included is a brief description of some initial trials that the system has undergone and the lessons learned (and plans laid) based on those early "customer" engagements

    Analysis-Aware Approach To Entity Resolution

    No full text
    In the era of big data, in addition to large local repositories and data warehouses, today’s enterprises have access to a very large amount of diverse data sources, including web data repositories, continuously generated sensory data, social media posts, clickstream data from web portals, audio/video data capture, and so on. As a result, there is an increasing demand for executing up-to-the-minute analysis tasks on top of these dynamic and/or heterogeneous data sources by modern applications. Such new requirements have created challenging new problems for traditional entity resolution, and data cleaning in general, techniques. In this thesis, we respond to some of these challenges by developing an analysis-aware approach to entity resolution.First, we explore the problem of analysis-aware data cleaning in the context of selection queries. Specifically, we propose an “on-the-fly” data cleaning framework in the context of SQL-like selection queries. The objective of this framework is to perform the minimal number of cleaning steps that are required to answer a user query correctly. Our approach leverages the concept of vestigiality to reduce cleaning overhead. We conducted a comprehensive empirical evaluation of the proposed solution to demonstrate its significant advantage in terms of efficiency over the traditional techniques for the given problem settings.Subsequently, we study analysis-aware data cleaning for the more general case where queries can be complex SQL-style selections and joins. In particular, we develop a framework for integrating entity resolution techniques with query processing. The aim of this framework is to utilize the query semantics to reap the benefits of early predicate evaluation while still minimizing redundant computation in the form of data cleaning. This framework relies on the notion of polymorphic operators, which are analogous to the common relational algebra operators with one exception: they know how to test the query predicates on the dirty data prior to cleaning it. We conducted extensive experiments to evaluate the effectiveness of our approach on real and synthetic datasets.Overall, our experiments demonstrate outstanding results – that is our analysis-aware approaches are significantly better compared to traditional ER techniques, especially when the query is very selective

    QuERy: A Framework for Integrating Entity Resolution with Query Processing

    No full text
    ABSTRACT This paper explores an analysis-aware data cleaning architecture for a large class of SPJ SQL queries. In particular, we propose QuERy, a novel framework for integrating entity resolution (ER) with query processing. The aim of QuERy is to correctly and efficiently answer complex queries issued on top of dirty data. The comprehensive empirical evaluation of the proposed solution demonstrates its significant advantage in terms of efficiency over the traditional techniques for the given problem settings

    QDA: A Query-Driven Approach to Entity Resolution

    No full text
    corecore